Introduction: Simple linear regression models are essential tools to examine associations between variables. These models provide estimates of the rate of change in the response variable (i.e., y-axis) given a change in the explanatory variable (x-axis). Because many biological variables are continuous and usually follow a normal distribution, simple linear regression models are fundamental tools to describe the relationship between variables of interest.

In this module, you will continue to define variables of interest and estimate their descriptive statistics (M.1 and M.2) while learning to fit simple linear regression models to test associations between fertility rate and age in female rhesus macaques.



Upon completion of this module, you will be able to:


References:


Extra training:


Associated literature:


Expand for R Notation and functions index

Functions:




Testing associations between fertility rate and age in female rhesus macaques


Rhesus macaques are an informative comparative model due to several key demographic traits shared with many long-lived mammals, including humans. Long-lived mammals often show reproductive senescence (decreased in reproduction with age). However, the relationship of fertility rate as a function of age may be complex, showing periods of optimal (prime) reproduction, periods of stability, and other periods of decline. Describing these patterns is important to understand changes in mean fertility at the population-level, but also why some individual adult females (Fig 1) may not reproduce in a particular birth season.


**Fig 1**. Cayo Santiago rhesus macaque adult females with and infant. Photo by Raisa Hernández-Pacheco.

Fig 1. Cayo Santiago rhesus macaque adult females with and infant. Photo by Raisa Hernández-Pacheco.


The Cayo Santiago rhesus macaque longitudinal demographic data on all individuals provides the unique opportunity to study important demographic rates such as age-specific fertility across the entire lifespan. In this module, you will be using Cayo Santiago rhesus macaque female reproductive data from Luevano et al 2022 (cayo_fert, Table 1) to explore whether female fertility is a function of age using simple linear regression analysis. This is authentic data collected through daily visual censuses performed by the staff of the Caribbean Primate Research Center (CPRC) of the University of Puerto Rico-Medical Sciences Campus.

Before coding, explore the data in Table 1.


Table 1: Cayo Santiago female rhesus macaque data

Metadata of cayo_fert: Demographic data of Cayo Santiago female rhesus macaques from Luevano et al. 2022.



1. Checking data quality and generating reproductive variables

Before any analysis, check the data set and understand its attributes. Refer to Module M.1 for data structure exploration.


Guiding questions:

  1. Is there full information on all variables?

Click for Answer!

Yes!


  1. Why are there repeated measurements (i.e., same ID multiple times)?

Click for Answer!

Each female was followed across their entire life in the island (i.e., until they exited the population through death, culling, or right censorship). Thus, the data provides information of the reproductive performance of females per year of their life.


Because the data provides information on the reproductive performance of females per year of their life, you can estimate the mean age-specific fertility rate (number of offspring produced at age x divided by the total number of females of age x).

Below, you will learn how to! You will use tidyverse and rstatix to summarize age-specific fertility rate (refer to Module M.2).

# loading packages
library(tidyverse)
library(rstatix)

# summary table with age-specific fertility
fert <- cayo_fert %>%
  group_by(age) %>% # grouping by age
  get_summary_stats(offspring) #summary of number of offspring

fert


Guiding questions:

  1. Can you describe any pattern in mean fertility rate as females get older?

Click for Answer!

Fertility increases with age up to a maximum value at age ~9 and then decreases until age 24.


2. Visializing data


Challenge!

  1. Plot mean age-specific fertility versus age.


Guiding questions:

  1. Can you describe the relationship between mean fertility and age?

Click for Answer!

The relationship between mean fertility and age is nonlinear. A few females start reproducing at age 3, most females reproduce at ages 4 to 15, and then fertility decreases. After 24 years of age, no female reproduces.


  1. Do you think a linear model is an appropriate statistical tool to describe such association?

Click for Answer!

Yes. A linear model would describe the negative association between mean fertility and age.


3. Fitting simple linear regression models

Below, you will fit a simple linear regression to the rhesus macaque female fertility data. For this, you will use the command lm(), where the first argument is the response variable, followed by the explanatory variable (separated by ~). To see the model output, you will use summary(). Make sure the variables are of class numeric!

# linear model
lm1 <- lm(mean ~ age, data = fert)

# model output
summary(lm1)


Guiding questions:

  1. Can you explain all components of the summary output?

Click for Answer!

Call: the model formula defined.

Residuals: Minimum, 1st quartile, median, 3rd quartile, and maximum value of the model residuals. As a refresher, the residuals are the difference between the observed value of the response variable (y-axis) and the predicted value from the regression line.

Coefficients: Model parameters (y-intercept and slope) and statistics for each parameter (SE, t statistic, p-value).

Residual standard error: The standard deviation of the residuals (smaller residual SE means model predictions are better)

R-squared: Coefficient of determination and indicates the proportion of the variation in the dependent variable that is predictable from the independent variable.


  1. Can you define and interpret the linear regression?

Click for Answer!

y = -0.019259*age + 0.821764. Mean fertility at age 0 is predicted to be 0.822. Per an increase in 1 year of age, mean fertility decreases by 0.019.


4. Plotting predictions from simple linear regression models

There are several ways to plot model predictions in R. First, you will implement the function predict() to extract model predictions (values of the linear regression). Second, you will implement the function geom_smooth(methods = “lm”) from ggplot2 to automatically plot the predicted values given the empirical data.

# plotting empirical and predicted values
p1 <- ggplot(fert,aes(age,mean)) +
  geom_point() +
  geom_line(aes(age,predict(lm1))) #adding new data to geom_line()

p1

# using geom_smooth for model predictions
p2 <- ggplot(fert,aes(age,mean)) +
  geom_point() +
  geom_smooth(method = "lm") #with SE

p2


Guiding questions:

  1. Can you improve these plots? Try it!


5. Fitting simple linear regression models with quadratic effects

As expected for a long-lived mammal, the mean fertility decreases with age. However, you can argue that a straight line (single slope) is not the best model to describe the observed association. According to the data, there is a maximum mean fertility at an optimal age. Below, you will fit the same simple linear regression model but now adding a quadratic term for age using the function I().

# linear model with quadratic term
lm2 <- lm(mean ~ age + I(age^2), data = fert)

# model output
summary(lm2)


6. Plotting predictions from simple linear regression models with quadratic terms

As above, you can use the same functions to extract and plot model predictions.

# plotting empirical and predicted values
p3 <- ggplot(fert,aes(age,mean)) +
  geom_point() +
  geom_line(aes(age,predict(lm2))) #adding new data to geom_line()

p3

# using geom_smooth for model predictions
p4 <- ggplot(fert,aes(age,mean)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2)) #specifying the quadratic formula

p4


Challenge!

  1. Improve these plots (refer to Module M.3).

  2. Interpret model output and define the regression.


Question: Using your statistical knowledge and each model output, which model seems to best fit the data and why, lm1 or lm2?


FIN



Acknowledgements: The creation of this module was funded by the National Science Foundation DBI BRC-BIO award 2217812. Cayo Santiago is supported by the Office of Research Infrastructure Programs (ORIP) of the National Institutes of Health, grant 2 P40 OD012217, and the University of Puerto Rico (UPR), Medical Sciences Campus.


Authors: Raisa Hernández-Pacheco and Alexandra L. Bland, California State University, Long Beach